Group Names

1. SHAHZAD AMIN - 1871077

2. HAFIZ MUHAMMAD HASSAN - 1873829

Topic: Stock, Dependency and Graphs

To start working on the homework we have to download stocks data. For this we have first created the CSV file named as stocks.csv which simply contains the names of the comapnies for which we have to collect data. To do this we have also focused on collecting the 10 companies data in each category discussed in homework description. We have also noticed that instead of 10 categories there are 11 categories. Also for this part we have focused on collecting the data from January 1, 2003 through January 1, 2008.

This makes us total of 110 companies categorized into 11 Global Industry Classification Standard (GICS).

Point 2: Select a sensible portfolio of stocks and take data from January 1, 2003 through January 1, 2008, before the onset of the “financial crisis”. Build the data matrix X

Lets assume we have already all the packages. I will include them here.

require(tseries, warn.conflicts=F, quietly = TRUE)
require(igraph, warn.conflicts=F, quietly = TRUE)
require(lemon,warn.conflicts=F, quietly = TRUE)
require(GGally,warn.conflicts=F, quietly = TRUE)
require(energy,warn.conflicts=F, quietly = TRUE)
require(randomcoloR,warn.conflicts=F, quietly = TRUE)

After that we just read the csv file which have rows. We have showed the rows that it contains. We also have to use the lemon library to pretty print the dataframe values.

stock <- read.csv("stocks.csv")
knitr::kable(head(stock), format="markdown")
Symbol Security GICSSector GICS.Sub.Industry Location Date.first.added CIK Founded
A Agilent Technologies Inc Health Care Health Care Equipment Santa Clara, California 6/5/2000 1090872 1999
UNP Union Pacific Industrials Railroads Omaha, Nebraska 100885
AAP Advance Auto Parts Consumer Discretionary Automotive Retail Roanoke, Virginia 7/9/2015 1158449 1932
AAPL Apple Inc. Information Technology Technology Hardware, Storage & Peripherals Cupertino, California 11/30/1982 320193 1977
ABC AmerisourceBergen Corp Health Care Health Care Distributors Chesterbrook, Pennsylvania 8/30/2001 1140859 1985
ABMD ABIOMED Inc Health Care Health Care Equipment Danvers, Massachusetts 5/31/2018 815094 1981

From above we can see that Symbol column contains the companies name and GICSSector contains the category. After that we have downloaded data from January 1, 2003 through January 1, 2008.

symbols <- stock[["Symbol"]]
data <- NULL
options("getSymbols.yahoo.warning"=FALSE)
options("getSymbols.warning4.0"=FALSE)
for (row in symbols) {
  if (is.null(data)) {
    data <- suppressWarnings(get.hist.quote(instrument = row, start = "2003-01-01", end = "2008-01-01", quote = "Close", provider = "yahoo", drop = TRUE))
  }
  else {
    temp <- suppressWarnings(get.hist.quote(instrument = row, start = "2003-01-01", end = "2008-01-01", quote = "Close", provider = "yahoo", drop = TRUE))
    data <- cbind(data, temp)
  }
}
cMatrix <- as.matrix(data)

After downloading the data in cMatrix we have to assign the names to columns and we have to remove the na values and infinite values. We are replacing them with 0 instead of removing the whole row. For getting required X matrix we also have to use diff and log formula on our matrix. I will be outputing its first few rows and columns using kable function.

cMatrix <- as.matrix(data)
colnames(cMatrix) <- symbols
cMatrix[is.na(cMatrix)] <- 0
logMatrix <- diff(log(cMatrix))
logMatrix[is.na(logMatrix)] <- 0
logMatrix[is.infinite(logMatrix)] <- 0
knitr::kable(head(logMatrix[,1:20],6), format="markdown")
A UNP AAP AAPL ABC ABMD ABT ACN ADBE ADI ADM ADP ADS ADSK AEE AEP AES AFL AGN AIG
2003-01-03 -0.0047133 -0.0065585 -0.0155700 0.0067342 0.0149835 0.0231076 0.0096908 0.0074946 0.0269766 0.0042381 0.0063694 -0.0056783 -0.0027816 -0.0284840 0.0136249 0.0200253 0.0628009 0.0003165 0.0193509 -0.0029895
2003-01-06 0.0466631 0.0179333 -0.0659116 0.0000000 0.0118948 0.0275362 0.0064087 0.0184946 0.0449806 0.0596970 0.0015860 0.0027197 -0.0168544 0.0284840 0.0418107 0.0590889 0.0000000 0.0125788 0.0129210 0.0330474
2003-01-07 -0.0090589 0.0001616 0.0056436 -0.0033619 -0.0260110 -0.0849932 -0.0460053 -0.0195618 0.0357053 0.0046974 -0.0143658 -0.0094270 0.0056497 0.0422456 -0.0350673 -0.0175298 -0.0234615 -0.0084733 -0.0033841 -0.0196643
2003-01-08 -0.0497512 -0.0130083 -0.0062969 -0.0204083 -0.0372670 -0.0300159 0.0248996 -0.0069649 -0.0496148 -0.0266682 -0.0072610 -0.0085107 -0.0028208 -0.0105612 0.0018522 -0.0060240 -0.0059524 0.0022037 -0.0164050 -0.0125517
2003-01-09 0.0390776 0.0105821 0.0140589 0.0088943 -0.0128590 0.2314195 0.0032570 -0.0124426 0.0400993 0.0484199 0.0128723 0.0159604 0.0168071 0.0138388 -0.0224571 0.0070246 0.0409414 0.0125002 0.0027529 0.0363914
2003-01-10 0.0167220 -0.0035691 -0.0032269 0.0027219 -0.0113658 -0.0968498 -0.0050151 -0.0822226 0.0144140 0.0205977 -0.0088319 -0.0106952 0.0456105 -0.0245127 -0.0066461 -0.0537525 -0.0379608 -0.0125002 -0.0093217 -0.0095002

After that we plotted logMatrix which will be our required X matrix.

plot(logMatrix)

Above completes point 2 part of the homework in Your Job section.

Lets move to point 3.

Point 3: With this data, consider the usual Pearson correlation coefficient between stocks, and implement the bootstrap procedure described at page 3 of our notes to build marginal correlation graphs. In particular, visualize the dynamic of the graph as ε varies, highlighting the gics sectors with annotation/node color. Draw some conclusion: is there any statistical evidence to support the claim that stocks from the same sector cluster together? Explain.

For this we have to calculate Pearson correlation coefficient between stocks. For this we have to simply use cor function.

pearson_correlation_matrix <- cor(logMatrix, method = "pearson", use = "everything")
R <- pearson_correlation_matrix
R[is.na(R)] <- 0
knitr::kable(head(R[,1:20],6), format="markdown") # outputing the rows
A UNP AAP AAPL ABC ABMD ABT ACN ADBE ADI ADM ADP ADS ADSK AEE AEP AES AFL AGN AIG
A 1.0000000 0.3132293 0.2453006 0.2894047 0.2060193 0.1929294 0.2173697 0.2595466 0.3440663 0.4530926 0.1865611 0.3108620 0.2168602 0.2993274 0.2265353 0.2792155 0.2083448 0.2708431 0.2246741 0.3329415
UNP 0.3132293 1.0000000 0.2409231 0.2394450 0.2275416 0.1805622 0.2706303 0.2816294 0.2722011 0.2881027 0.2555853 0.3005328 0.2097893 0.2803825 0.2796656 0.2944920 0.2480392 0.2920236 0.1999380 0.3494798
AAP 0.2453006 0.2409231 1.0000000 0.1999078 0.1331749 0.1000599 0.1607918 0.1810424 0.1746663 0.2027267 0.1106188 0.1685443 0.1931692 0.1736631 0.1184338 0.1588764 0.1793692 0.2229857 0.1487453 0.2467757
AAPL 0.2894047 0.2394450 0.1999078 1.0000000 0.1270099 0.0927325 0.1895600 0.1360781 0.3016231 0.2960004 0.1349346 0.2564773 0.2002367 0.2784136 0.1937273 0.2194870 0.1332074 0.2081890 0.2015270 0.2546643
ABC 0.2060193 0.2275416 0.1331749 0.1270099 1.0000000 0.1144457 0.2634622 0.1995059 0.1942354 0.1777007 0.1762504 0.2228011 0.1627588 0.1479000 0.2147295 0.2136665 0.1685642 0.1420544 0.2784182 0.2002196
ABMD 0.1929294 0.1805622 0.1000599 0.0927325 0.1144457 1.0000000 0.1759887 0.1675586 0.1482111 0.1929230 0.0936166 0.1393736 0.1013807 0.1779248 0.1643561 0.2037472 0.1463400 0.1504244 0.1378275 0.1819907

From above we can see that columns and rows that have same value have higher dependency on each other usaully 1 as you can see the first row and first column value of “A” company.

Now lets move to calculating Marginal Correlation Graphs.. But first we have to calculating the bootstrap confidence interval for this we have used our own algorithm instead of using Boot function.

#Calculating bootstarp confidence interval
B = 1000 # considering only 1000 iteration for sampling purposes
brep = rep(NA, B)
for (b in 1:B){
  row_idx <- sample(1:nrow(logMatrix), replace = T)
  bSample <- logMatrix[row_idx,] # bootstrap sample
  rownames(bSample) <- rownames(logMatrix)
  sample_Cor <- cor(bSample, method = "pearson", use = "everything") # getting correlation from original matrix for sample
  sample_Cor[is.na(sample_Cor)] <- 0 # replacing values 
  btheta  <- sqrt(nrow(logMatrix))*max(sample_Cor-R)   # bootstrap replicate
  brep[b] <- btheta            # save
}

After above we have our bootstrapped replicated values. Till now we have ∆b After we will calculate the our Emperical CDF.

Gstar<- ecdf(brep)
plot(Gstar)

From above you can see that values graph changes between 0 to 1. Now I have Fn(t) hat. I am going to move t alpha based on 0.95 confidence.

confidence <- as.numeric(quantile(brep,0.95)/sqrt(nrow(logMatrix)))
confidence
## [1] 0.1817225

From above you can notice that its value is 0.18. Lets move to building P􏰂R[j,k]∈Cj,k(α) for all (j,k)􏰃 →n 1−α for this We needed to build new metrix based on this condition “If we have a confidence interval Cn,α then we can put an edge whenever [−ε, +ε] ∩ Cn,α = ∅.”

We wrote function which will create matrix for edges which is based on our confidence interval and error as ee = 0.1

matrix_apply <- function(m,ee,con) {
  m2 <- m
  for (r in seq(nrow(m2))){
    for (c in seq(ncol(m2))){
      l <- m[[r,c]]-con
      h <- m[[r,c]]+con
      eel <- ee*-1
      eeh <- ee
      if(h <= eel || l >= eeh){
        m2[[r,c]]<-1
      }
      else{
        m2[[r,c]]<-0
      }
    }
  }
  return(m2)
}

ee<-0.13
m<-matrix_apply(R,ee,confidence)
knitr::kable(head(m[,1:10],6), format="markdown") 
A UNP AAP AAPL ABC ABMD ABT ACN ADBE ADI
A 1 1 0 0 0 0 0 0 1 1
UNP 1 1 0 0 0 0 0 0 0 0
AAP 0 0 1 0 0 0 0 0 0 0
AAPL 0 0 0 1 0 0 0 0 0 0
ABC 0 0 0 0 1 0 0 0 0 0
ABMD 0 0 0 0 0 1 0 0 0 0

From above we have 1 in the matrix when edge is possible and 0 otherwise. Moving to creating a graph from adjacency matrix as m and simplify it if there are some stocks which points to itself. But before doing I need a function which will color the stocks which has same GICS .

color_graph <- function(graph,stocks,sectors,colors){
  for(i in 1:nrow(stocks)) {
    row <- stocks[i,]
    
    #assigning the colors
    for(j in 1:length(sectors)) {
      if(sectors[j] == row$GICSSector){
        V(graph)[i]$color <- colors[j] # assigning color to each stock but assign same color which are in same sector
        break
      }
    }
  }
  return (graph)
}

Lets plot the graph after using above function for coloring.

ee <- 0.13
sectors <- unique(stock$GICSSector) # getting GICS 
colors <- distinctColorPalette(length(sectors))  # getting colors based on GICS 
m <- matrix_apply(R,ee,confidence)
g<-simplify(graph_from_adjacency_matrix(m,diag = FALSE,mode = "undirected"))
plot(color_graph(g,stock,sectors,colors))
legend("bottomleft", legend= sectors  , col = colors , bty = "n", pch=20 , pt.cex = 0.7, cex = 0.35, text.col=colors , horiz = FALSE, inset = c(0.1, 0.1))

From above you can see that with error 0.13 we get the perfect result of what we have wanted to support our claim that “stocks from the same sector cluster together”. To support this claim we have drawn the graph containing unique color for stocks that are in the same GICS

We will going to run this with different error values to see what will happen to our stocks when we increase error value ee

eeee<- 0.1
for( i in 1:10){
ee <-ee + 0.02
m<- matrix_apply(R,ee,confidence)
g<-simplify(graph_from_adjacency_matrix(m,diag = FALSE,mode = "undirected"))
title = paste("Error: ",ee," and Confidence Level: ",round(0.95,2),sep = "")
plot(color_graph(g,stock,sectors,colors) ,main="")
title(main =title,  line = 0.5, cex.main = 0.4)
}

From above you can see that our confidence value wa 0.18 and as we increase our error value from 0.10 to 0.37 stocks from same GICS started spreading instead of merging on same clustor. But still you can see that stocks from same GICS always clustered togather this proves our hypothesis.

Now Lets move to Point 4.

Point 4: Again with the data in X, we now want to build a marginal correlation graph based on γ2, the distance covariance (seethe Appendix in our notes). This time we don’t have a confidence interval available, hence we will simply go for amultiple hypothesis testing (with and without Bonferroni correction) placing an edge {j, k} between stock j and stock k if and only if we reject the null hypothesis that γ2 = 0. Use the functions in the package energy to perform these tests, i,j then build and visualize the graph commenting on the results.

To start working on this we need to decrease number of stocks otherwise the result are not statisfactory and the plot we have is not clustored togather.

-> We have decrease the size of the stocks, considering only 4 stocks from each category.

options("getSymbols.yahoo.warning"=FALSE)
options("getSymbols.warning4.0"=FALSE)
stock <- read.csv("stocks_short.csv")
symbols <- stock[["Symbol"]]

data <- NULL
for (row in symbols) {
  if (is.null(data)) {
    data <- suppressWarnings(get.hist.quote(instrument = row, start = "2003-01-01", end = "2008-01-01", quote = "Close", provider = "yahoo", drop = TRUE))
  }
  else {
    temp <- suppressWarnings(get.hist.quote(instrument = row, start = "2003-01-01", end = "2008-01-01", quote = "Close", provider = "yahoo", drop = TRUE))
    data <- cbind(data, temp)
  }
}
cMatrix <- as.matrix(data)
colnames(cMatrix) <- symbols
cMatrix[is.na(cMatrix)] <- 0
logMatrix <- diff(log(cMatrix))
logMatrix[is.na(logMatrix)] <- 0
plot(logMatrix)

Above steps are the same for creating our Matrix X. I have just changed the CSV file to have only 34 companies names instead of 110 as I have did before.

# created empty correlation matrix 
cor_mat <- matrix(1:ncol(logMatrix)**2,nrow=ncol(logMatrix),dimnames=list(colnames(logMatrix),colnames(logMatrix)))
logMatrix[is.na(logMatrix)] <- 0 # na values 
logMatrix[is.infinite(logMatrix)] <- 0 # removing infinite values.
alpha <- 0.05 # considering alpha directly instead of calculating it from confidence level
for(i in seq(nrow(cor_mat))){
  for(j in seq(ncol(cor_mat))){
    X <- logMatrix[,i]
    Y <- logMatrix[,j]
    dd <- dcov.test(X,Y,index=0.01,R=100) # this test will give us distance covariance value. This will do all the testing as behind the scene as described in class
    cor_mat[i,j] <- dd$p.value
  }
}

After above code will run we will have cor_mat which contains our desired values. But it takes around 2 hours to run. So we have already saved it to Q2Final.RData file which has already that matrix.

load("Q2Final.RData")

For plotting we need again the matrix_apply function for getting edges for following this condition on matrix each value “if and only if we reject the null hypothesis that γ2 = 0.”. So

matrix_apply_alpha <- function(m,alpha) {
  m2 <- m
  for (r in seq(nrow(m2))){
    for (c in seq(ncol(m2))){
      val <- m[[r,c]]
      if(val>=alpha){
        m2[[r,c]]<-1
      }
      else{
        m2[[r,c]]<-0
      }
    }
  }
  
  return(m2)
}

color_graph <- function(graph,stocks,sectors,colors){
  for(i in 1:nrow(stocks)) {
    row <- stocks[i,]
    
    #assigning the colors
    for(j in 1:length(sectors)) {
      if(sectors[j] == row$GICSSector){
        V(graph)[i]$color <- colors[j] # assigning color to each stock but assign same color which are in same sector
        break
      }
    }
  }
  return (graph)
}
sectors <- unique(stock$GICSSector) # getting GICS 
colors <- distinctColorPalette(length(sectors))  # getting colors based on GICS 
m <- matrix_apply_alpha(cor_mat,alpha)
g<-simplify(graph_from_adjacency_matrix(m,diag = FALSE,mode = "undirected"))
plot(color_graph(g,stock,sectors,colors) ,main="")
title(main ="Marginal Correlation Graph",  line = 0.5, cex.main = 0.4)

Finally, we have all the things we need. Lets move to point 5.

Finally, if possible using the same portfolio of stocks, grab data from January 1, 2013 through January 1, 2018. Build the new matrix Y = 􏰀yt,j􏰁 and repeat the previous analysis commenting on observed differences between the two t,j time-frames.

Now, for this part we have focused on collecting the data from January 1, 2013 through January 1, 2018.

This makes us total of 110 companies categorized into 11 Global Industry Classification Standard (GICS).

Point 2: Select a sensible portfolio of stocks and take data from January 1, 2013 through January 1, 2018, before the onset of the “financial crisis”. Build the data matrix X

stock <- read.csv("stocks.csv")
symbols <- stock[["Symbol"]]
data <- NULL
options("getSymbols.yahoo.warning"=FALSE)
options("getSymbols.warning4.0"=FALSE)
for (row in symbols) {
  if (is.null(data)) {
    data <- suppressWarnings(get.hist.quote(instrument = row, start = "2013-01-01", end = "2018-01-01", quote = "Close", provider = "yahoo", drop = TRUE))
  }
  else {
    temp <- suppressWarnings(get.hist.quote(instrument = row, start = "2013-01-01", end = "2018-01-01", quote = "Close", provider = "yahoo", drop = TRUE))
    data <- cbind(data, temp)
  }
}
cMatrix <- as.matrix(data)

After downloading the data in cMatrix we have to assign the names to columns and we have to remove the na values and infinite values. We are replacing them with 0 instead of removing the whole row. For getting required X matrix we also have to use diff and log formula on our matrix. I will be outputing its first few rows and columns using kable function.

cMatrix <- as.matrix(data)
colnames(cMatrix) <- symbols
cMatrix[is.na(cMatrix)] <- 0
logMatrix <- diff(log(cMatrix))
logMatrix[is.na(logMatrix)] <- 0
logMatrix[is.infinite(logMatrix)] <- 0
knitr::kable(head(logMatrix[,1:20],6), format="markdown")
A UNP AAP AAPL ABC ABMD ABT ACN ADBE ADI ADM ADP ADS ADSK AEE AEP AES AFL AGN AIG
2013-01-03 0.0035753 0.0014002 0.0000000 -0.0127026 -0.0020716 -0.0176865 0.0373589 -0.0036266 -0.0155083 -0.0162679 -0.0080546 0.0039313 0.0080194 -0.0155507 -0.0041593 -0.0006876 -0.0054695 -0.0256640 0.0201029 -0.0082577
2013-01-04 0.0195554 0.0173395 0.0154682 -0.0282499 0.0066597 -0.0029784 -0.0060296 0.0055073 0.0100159 -0.0179471 0.0270567 0.0088316 0.0057079 -0.0002749 0.0000000 -0.0016061 0.0234880 -0.0126028 -0.0072421 0.0033112
2013-01-07 -0.0072591 -0.0047480 -0.0034016 -0.0058997 0.0031993 -0.0120031 0.0081314 -0.0043454 -0.0049955 0.0030528 -0.0422918 -0.0038967 0.0216695 -0.0074535 -0.0125829 -0.0041417 -0.0317487 -0.0038506 0.0042115 -0.0102451
2013-01-08 -0.0080227 -0.0006912 -0.0164908 0.0026878 -0.0013699 -0.0030234 0.0002998 0.0057896 0.0052576 -0.0103702 0.0117126 0.0057549 0.0051677 0.0058019 0.0045352 -0.0050855 0.0000000 0.0115076 0.0118372 -0.0078234
2013-01-09 0.0266496 0.0054392 0.0030437 -0.0157523 0.0000000 -0.0284088 0.0065751 0.0070468 0.0135419 -0.0026094 0.0049279 -0.0025349 0.0023168 0.0101412 0.0045146 0.0002317 0.0045977 0.0011435 0.0089584 0.0030807
2013-01-10 0.0073547 0.0015268 -0.0087409 0.0123198 0.0027378 -0.0031201 0.0083061 -0.0087802 -0.0010352 0.0120413 -0.0049279 0.0037156 -0.0047681 -0.0040989 0.0127879 0.0089955 0.0172810 0.0164356 0.0060416 0.0011180

After that we plotted logMatrix which will be our required X matrix.

plot(logMatrix)

Comparison: If we look at result of this matrix you can clearly see that it x-axis is between -0.05 to 0.05. In new matrix, we values are more desprerse on X-axis and previous data values are more despersed in Y-axis. but both of the graph are concenterated over 0.

Above completes point 2 part of the homework in Your Job section for data 20013 to 2018

Lets move to point 3.

Point 3: With this data, consider the usual Pearson correlation coefficient between stocks, and implement the bootstrap procedure described at page 3 of our notes to build marginal correlation graphs. In particular, visualize the dynamic of the graph as ε varies, highlighting the gics sectors with annotation/node color. Draw some conclusion: is there any statistical evidence to support the claim that stocks from the same sector cluster together? Explain.

For this we have to calculate Pearson correlation coefficient between stocks. For this we have to simply use cor function.

pearson_correlation_matrix <- cor(logMatrix, method = "pearson", use = "everything")
R <- pearson_correlation_matrix
R[is.na(R)] <- 0
knitr::kable(head(R[,1:20],6), format="markdown") # outputing the rows
A UNP AAP AAPL ABC ABMD ABT ACN ADBE ADI ADM ADP ADS ADSK AEE AEP AES AFL AGN AIG
A 1.0000000 0.4265864 0.2084390 0.3028485 0.2831098 0.2293371 0.5175138 0.4815826 0.4311880 0.4734604 0.3342113 0.4471464 0.3796908 0.4548649 0.2030784 0.1914714 0.3294797 0.4427146 0.3577668 0.4766110
UNP 0.4265864 1.0000000 0.2594759 0.2688674 0.2413006 0.1741804 0.3541362 0.3972991 0.3231438 0.4270688 0.3507299 0.4396891 0.3737763 0.3100782 0.1960577 0.1839721 0.2933191 0.4613683 0.2361382 0.4359570
AAP 0.2084390 0.2594759 1.0000000 0.1072128 0.1539561 0.1245225 0.2647568 0.2037882 0.1600790 0.1842714 0.2246974 0.2555626 0.2386067 0.1284772 0.1418276 0.1369821 0.1965649 0.2605496 0.1791216 0.2446367
AAPL 0.3028485 0.2688674 0.1072128 1.0000000 0.1409962 0.1809675 0.2584548 0.2724068 0.2726530 0.3789044 0.1659758 0.2545354 0.2133226 0.2794214 0.1002925 0.1325608 0.1745371 0.2492528 0.1834857 0.2946559
ABC 0.2831098 0.2413006 0.1539561 0.1409962 1.0000000 0.1746143 0.3612115 0.2360283 0.2096561 0.2048848 0.1725127 0.3267634 0.2718505 0.2205016 0.1087532 0.1189980 0.1478877 0.2732205 0.3849018 0.3044349
ABMD 0.2293371 0.1741804 0.1245225 0.1809675 0.1746143 1.0000000 0.2632927 0.2081403 0.2409888 0.2689047 0.1479566 0.2459297 0.2099626 0.2502158 0.0931469 0.0770499 0.1131560 0.1776666 0.1499866 0.1855646

From above we can see that columns and rows that have same value have higher dependency on each other usaully 1 as you can see the first row and first column value of “A” company.

Now Comparison: If we compare this matrix values to previous ones we can say that values are almost in .1 difference. So one can conlude that some of the companies values decreased and for some it is increase. Just by looking at Pearson Correlation values for each stock.

Lets move to calculating Marginal Correlation Graphs.. But first we have to calculating the bootstrap confidence interval for this we have used our own algorithm instead of using Boot function.

#Calculating bootstarp confidence interval
B = 1000 # considering only 1000 iteration for sampling purposes
brep = rep(NA, B)
for (b in 1:B){
  row_idx <- sample(1:nrow(logMatrix), replace = T)
  bSample <- logMatrix[row_idx,] # bootstrap sample
  rownames(bSample) <- rownames(logMatrix)
  sample_Cor <- cor(bSample, method = "pearson", use = "everything") # getting correlation from original matrix for sample
  sample_Cor[is.na(sample_Cor)] <- 0 # replacing values 
  btheta  <- sqrt(nrow(logMatrix))*max(sample_Cor-R)   # bootstrap replicate
  brep[b] <- btheta            # save
}

After above we have our bootstrapped replicated values. Till now we have ∆b After we will calculate the our Emperical CDF.

Gstar<- ecdf(brep)
plot(Gstar)

From above you can see that values graph changes between 0 to 1. Comparison: Look like both of line graph is almost the same.

Now I have Fn(t) hat. I am going to move t alpha based on 0.95 confidence.

confidence <- as.numeric(quantile(brep,0.95)/sqrt(nrow(logMatrix)))
confidence
## [1] 0.1838335

From above you can notice that its value is 0.18.

Comparison: Almost same value we compare both confidence values then 0.1827636 previous value and 0.1849653 is current one.

Lets move to building P􏰂R[j,k]∈Cj,k(α) for all (j,k)􏰃 →n 1−α for this We needed to build new metrix based on this condition “If we have a confidence interval Cn,α then we can put an edge whenever [−ε, +ε] ∩ Cn,α = ∅.”

We wrote function which will create matrix for edges which is based on our confidence interval and error as ee = 0.1

matrix_apply <- function(m,ee,con) {
  m2 <- m
  for (r in seq(nrow(m2))){
    for (c in seq(ncol(m2))){
      l <- m[[r,c]]-con
      h <- m[[r,c]]+con
      eel <- ee*-1
      eeh <- ee
      if(h <= eel || l >= eeh){
        m2[[r,c]]<-1
      }
      else{
        m2[[r,c]]<-0
      }
    }
  }
  return(m2)
}

ee<-0.13
m<-matrix_apply(R,ee,confidence)
knitr::kable(head(m[,1:10],6), format="markdown") 
A UNP AAP AAPL ABC ABMD ABT ACN ADBE ADI
A 1 1 0 0 0 0 1 1 1 1
UNP 1 1 0 0 0 0 1 1 1 1
AAP 0 0 1 0 0 0 0 0 0 0
AAPL 0 0 0 1 0 0 0 0 0 1
ABC 0 0 0 0 1 0 1 0 0 0
ABMD 0 0 0 0 0 1 0 0 0 0

Comparison: No significant difference found in values. Although we will can notice that few values for edges changed.

From above we have 1 in the matrix when edge is possible and 0 otherwise. Moving to creating a graph from adjacency matrix as m and simplify it if there are some stocks which points to itself. But before doing I need a function which will color the stocks which has same GICS .

color_graph <- function(graph,stocks,sectors,colors){
  for(i in 1:nrow(stocks)) {
    row <- stocks[i,]
    
    #assigning the colors
    for(j in 1:length(sectors)) {
      if(sectors[j] == row$GICSSector){
        V(graph)[i]$color <- colors[j] # assigning color to each stock but assign same color which are in same sector
        break
      }
    }
  }
  return (graph)
}

Lets plot the graph after using above function for coloring.

ee <- 0.13
sectors <- unique(stock$GICSSector) # getting GICS 
colors <- distinctColorPalette(length(sectors))  # getting colors based on GICS 
m <- matrix_apply(R,ee,confidence)
g<-simplify(graph_from_adjacency_matrix(m,diag = FALSE,mode = "undirected"))
plot(color_graph(g,stock,sectors,colors))
legend("bottomleft", legend= sectors  , col = colors , bty = "n", pch=20 , pt.cex = 0.7, cex = 0.35, text.col=colors , horiz = FALSE, inset = c(0.1, 0.1))

From above you can see that with error 0.13 we get the perfect result of what we have wanted to support our claim that “stocks from the same sector cluster together”. To support this claim we have drawn the graph containing unique color for stocks that are in the same GICS

Comparison For previous years some companies are more despersed then current years from 2013 to 1018. You can clearly see onlt BBY and CM was out from the clustor but previously we had severals like CMG, AGN etc.

We will going to run this with different error values to see what will happen to our stocks when we increase error value ee

eeee<- 0.1
for( i in 1:10){
ee <-ee + 0.02
m<- matrix_apply(R,ee,confidence)
g<-simplify(graph_from_adjacency_matrix(m,diag = FALSE,mode = "undirected"))
title = paste("Error: ",ee," and Confidence Level: ",round(0.95,2),sep = "")
plot(color_graph(g,stock,sectors,colors) ,main="")
title(main =title,  line = 0.5, cex.main = 0.4)
}

From above you can see that our confidence value wa 0.18 and as we increase our error value from 0.10 to 0.37 stocks from same GICS started spreading instead of merging on same clustor. But still you can see that stocks from same GICS always clustered togather this proves our hypothesis.

Comparison: If we compare graphs and values initially for 2013 to 2018 stock values are more dependent and clustored togather than 2003 to 2008 stocks values.

Now Lets move to Point 4.

Point 4: Again with the data in X, we now want to build a marginal correlation graph based on γ2, the distance covariance (seethe Appendix in our notes). This time we don’t have a confidence interval available, hence we will simply go for amultiple hypothesis testing (with and without Bonferroni correction) placing an edge {j, k} between stock j and stock k if and only if we reject the null hypothesis that γ2 = 0. Use the functions in the package energy to perform these tests, i,j then build and visualize the graph commenting on the results.

To start working on this we need to decrease number of stocks otherwise the result are not statisfactory and the plot we have is not clustored togather.

-> We have decrease the size of the stocks, considering only 4 stocks from each category.

options("getSymbols.yahoo.warning"=FALSE)
options("getSymbols.warning4.0"=FALSE)
stock <- read.csv("stocks_short.csv")
symbols <- stock[["Symbol"]]

data <- NULL
for (row in symbols) {
  if (is.null(data)) {
    data <- suppressWarnings(get.hist.quote(instrument = row, start = "2003-01-01", end = "2008-01-01", quote = "Close", provider = "yahoo", drop = TRUE))
  }
  else {
    temp <- suppressWarnings(get.hist.quote(instrument = row, start = "2003-01-01", end = "2008-01-01", quote = "Close", provider = "yahoo", drop = TRUE))
    data <- cbind(data, temp)
  }
}
cMatrix <- as.matrix(data)
colnames(cMatrix) <- symbols
cMatrix[is.na(cMatrix)] <- 0
logMatrix <- diff(log(cMatrix))
logMatrix[is.na(logMatrix)] <- 0
plot(logMatrix)

Above steps are the same for creating our Matrix X. I have just changed the CSV file to have only 34 companies names instead of 110 as I have did before.

# created empty correlation matrix 
cor_mat <- matrix(1:ncol(logMatrix)**2,nrow=ncol(logMatrix),dimnames=list(colnames(logMatrix),colnames(logMatrix)))
logMatrix[is.na(logMatrix)] <- 0 # na values 
logMatrix[is.infinite(logMatrix)] <- 0 # removing infinite values.
alpha <- 0.05 # considering alpha directly instead of calculating it from confidence level
for(i in seq(nrow(cor_mat))){
  for(j in seq(ncol(cor_mat))){
    X <- logMatrix[,i]
    Y <- logMatrix[,j]
    dd <- dcov.test(X,Y,index=0.01,R=100) # this test will give us distance covariance value. This will do all the testing as behind the scene as described in class
    cor_mat[i,j] <- dd$p.value
  }
}

After above code will run we will have cor_mat which contains our desired values. But it takes around 2 hours to run. So we have already saved it to Q2Final.RData file which has already that matrix.

load("Q2Final.RData")

For plotting we need again the matrix_apply function for getting edges for following this condition on matrix each value “if and only if we reject the null hypothesis that γ2 = 0.”. So

matrix_apply_alpha <- function(m,alpha) {
  m2 <- m
  for (r in seq(nrow(m2))){
    for (c in seq(ncol(m2))){
      val <- m[[r,c]]
      if(val>=alpha){
        m2[[r,c]]<-1
      }
      else{
        m2[[r,c]]<-0
      }
    }
  }
  
  return(m2)
}
color_graph <- function(graph,stocks,sectors,colors){
  for(i in 1:nrow(stocks)) {
    row <- stocks[i,]
    
    #assigning the colors
    for(j in 1:length(sectors)) {
      if(sectors[j] == row$GICSSector){
        V(graph)[i]$color <- colors[j] # assigning color to each stock but assign same color which are in same sector
        break
      }
    }
  }
  return (graph)
}
sectors <- unique(stock$GICSSector) # getting GICS 
colors <- distinctColorPalette(length(sectors))  # getting colors based on GICS 
m <- matrix_apply_alpha(cor_mat,alpha)
g<-simplify(graph_from_adjacency_matrix(m,diag = FALSE,mode = "undirected"))
plot(color_graph(g,stock,sectors,colors) ,main="")
title(main ="Marginal Correlation Graph",  line = 0.5, cex.main = 0.4)

Finally, we have all the things we need. Lets move to point 5.